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I,  Bo  PL  E2qperim.en.ts  with  Accelerated 
Gradient  Methods  for  Linear  Equations  * 
by 

Ao  I«  Forsythe 
and 

Go  Ee  Forsythe 

National  Bureau  of  Standards^,  Los  Angeles 
I,fl  SUMMARY 

Various  gradient  (steepest  descent)  methods  for  solving  systems 
of  linear  equations  have  been  discussed  by  Cauchy  [ 2] ,,  Temple  [12]5 
Kantorovich  [Sls  and  others 0 The.  method  usually  discussed,,  the 
optimum  gradient  method  (erxplained  in  section  II)5  ordinarily 
converges  too  slowly  for  practical  use„  Under  the  general  leader- 
ship of  Professor  Magnus  Hestenes  at  the  Institute  for  Numerical 
Analysis  several  methods  have  been  studied  for  speeding  up  the 
gradient  method® 

A class  of  modified  gradient  methods^,  in  which  one  overshoots 
or  undershoots  the  optimum  pointy  is  presented  in  [7]®  In  [11]  Stein 
presents  numerical  experiments  with  the  matrix  used  below,,  show- 
ing that  consistently  undershooting  (“"almost  optimum”’  gradient- 
method)  provides  a self- accelerating  procedure®  Motzkin  and  one 
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of  us  propose  [5]  an  acceleration  step  to  be  inserted,  occasionally 
into  the  optimum  gradient  method.  (In  section  II  we  give  He-stenes  s 
interpretation  of  this  device  as  a minimization  in  two  dimens ions 0) 

The  purpose  of  the  I.  B.  Mc  experiments  now  reported  was  to  test 
the  latter  acceleration  procedure.  Incidental  to  this*  we  obtained 
additional  data  on  the  optimum,  almost  optimum,  and  other  gradient 
methods  . 

A survey  of  the  formulas  used  is  given  in  section  II?  and  the 
numerical  experiments  are  summarised  in  section  III.  In  section  IV 
we  study  these  data  in  some  detail*  Section  V contains  the  references 
referred  to  in  the  text  by  numbers  in  square  brackets. 

In  brief,  it  is  our  conclusion  that  for  two  test  matrices  of 
order  six,  the  acceleration  speeds  the  optimum  gradient  method  up 
by  a factor  of  from  7 to  18,  and  makes  the  optimum  method  possibly 
useful.  The  almost  optimum  gradient  method  is  something  like  half 
as  fast  as  our  accelerated  procedure  (on  the  basis  ^ two  test 
matrices)  but  ~ and  this  is  very  important  for  machine  work  =■  the 
almost  optimum  gradient  method  is  simpler  to  code.  The  more 
recently  developed  methods  of  Hestenes  and  Stiefel  [8]  now  appear 
to  offer  much  faster  convergence  at  a modest,  increase  in  complexity. 

II.  SUMMARY  OF  THE  THEORY 

For  simplicity  we  deal  with  the  field  of  real  numbers.  Let 
A be  an  n-by-n  matrix,  not  singular,  and  let  x,  b denote  n-rowed 
column  vectors.  We  are  interested  in  finding  the  solution  A '%  of 
the  system 
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(1)  Ax  ® b « 

Let  T denote  transposition  of  a matrix,,  The  positive  definite 
T T 

matrix  B - i A and  the  vector  c s A b will  frequently  be  used*  The 

2 T 

length  jyj  of  a column  vector  y will  be  defined  by  |y!~  “ y y#  We 
use  © to  denote  the  zero  vector* 

Let  f(x)  s jAx  - measure  the  deviation  of  any  vector  x 
from  the  solution  A \>0 
One  can  verify  that 

(2)  f(x)  s x^Bx  - 2x^0  + jbj"  « 

Suppose  x is  a given  approximation  to  A ^b^,  and  let  d be  a 
given  direction#  As  an  improvement  of  x we  may  select  the  vector 
y(ot)  ~ x “t>£d  for  which  f[y(®0]  assumes  its  minimum  as  a function 
of  the  real  variable  ©i  „ The  corresponding  value  of  °(  will  be  called^-# 

To  obtain  a formula  for^f  9 we  first  find  from  (2)  that 

(3)  f [y(=<  )]  - f(x)  + ^2dTBd  - 2<dT(Bx  - c)  * 

Introducing  the  abbreviation 

(U)  S s Bx  “ o j 

we  find  from  (3)  that 

(5)  y ^ dM  jf/d^Bd  * 


In  the  optimum  gradient  method  for  solving  (1)9  suggested  by 


k 


Cauchy  [2]  and  analyzed  by  Temple  [ 12 ] ^ Kantorovich  [9] 5 Hestenes 
and  Karush  [6]  (for  the  eigenvalue  problem )5  and  others^,  one  selects 
any  and  then  obtains  each  x^.+1  from  as  follows s For  each 
one  picks  d^  to  be  •§•  grad  f(x^)  s - c ® If,  0 and  takes; 


yk  K>  h - b3 


where a by  (5)5 


y.  - KT 

k k k k k 


Kantorovich  showed  on.  ppffl  Ibh*,  15U  of  [9]  that  in  the  optimum 
gradient  method 


(8) 


f(x,. 


a. 


max 


f(xk-X) 


n 


n 


n X 1 


where  A and  A are,,  respectively^,  the  largest  and  least  of  the 
(necessarily  positive)  eigenvalues  of  B®  It  follows  that 


(9)  jibe.  - bj  ^ |AXq  - b||ik  -4  0?  as  k — # 00  5 

so  that  the  method  converges®  Our  experience  suggests  that  the 
inequality  in  (9)  is  usually  nearly  an  equality j see  section  T¥« 
Since  ji  is  commonly  near‘d  1,s  the  optimum  gradient  method  is 

"^If  all  the  elements  of  A have  the  same  normal  distribution,  it 

results  from  p.  59  of  [1]  that  the  probable 88  value  of  ji  is 
=2 

81  about 88  1 ~ Ijm  (The  precise  meaning  of  this  is  not  stated  in 

[1].) 
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usually  too  slowly  convergent  for  practical  useQ  In  section  III 
we  give  examples  of  the  optimum  gradient  methods 

Mary  proposals  have  been  made  to  speed  up  the  process 0 In 
[7]  Hestenes  and  Stein  describe  a family  of  modified  gradient 
methods  in  which  one  changes  formula  (6)  to  read 


(10) 


P 1 f k » t 


Bx  ~ c 


k s ^ k :Xk 

where  ^ Is  a fixed  factor  in  the  range  0 < (3  < 2 x.  and.  prove  the 

convergence®  For  near  0*9  (called  the  *®almost  optimum”  gradient 
method)  the  evidence  in  Stein  [1,1]  suggests  that  the  convergence 
is  much  faster  than  for  the  optimum  gradient  method  ( P ^ 1)*  In 
section  III  we  summarize  these  data5  and  give  more  of  our  own® 

Other  proposed  accelerations  of  the  gradient  method  involve 
getting  ky  minimizing  f(x)  in  the  p-dimensional  linear  subspace 


x ^ x f.  f, 

K”p  1 k~p  2 k“p 


* B^"1  T.  (1  < p ^ n|  ^ . 

p krp  * i 


real  and  arbitrary )»  (This  is  equivalent  to  minimizing  f(x)  in  the. 
linear  p=  space  containing  ° 00?  an^  ^ ©)  Kantorovich 

[9]  suggests  use  of  p s 2®  Karush  [10]  considers  the  analogous 
process  -with  a general  p in  solving  the  eigenvalue  problem®  In. 

[8]  Hestenes  and  Stiefel  give  an  iterative  method  which  effectively 
can  give  p any  value  up  to  ne  (When  p - n the  method  is  an  exact 
solution  of  (l) e)  Motzkin  and  one  of  us  propose  [5]  an  acceleration 
step  which  Professor  Hestenes  has  shown  to  be  equivalent  to  taking 
p ” 2®  We  now  describe  this® 

It  is  a conjecture  (stated  in  [£]|  proved  for  n ® 3 in  [17] f 


seemingly  confirmed  in  the  present  experiments)  that 
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f in  the  optimum  gradient  method  the  error  vector 
A ^“b  is  asymptotically  a linear  combination 


(11)  { 


of  the  eigenvectors  Up  u of  B belonging  to  the 
.largest  Q ) and  least  ( '},  eigenvalues  of  Be 

(If  there  are  eigenvectors  of  B orthogonal  to  ~ A one  dis- 

v. 

regards  the  corresponding  eigenvalues  in  determining/^  an^^n°^ 

When  this  asymptotic  relationship  holds  for  a given  x^  the  sequence 
behaves  asymptotically  as  though  it  were  in  the  2-plane 
Tf  containing  and.  u <>  But 

/ if  one  cairies  out  the  optimum  gradient  process  in 
I 2-plane  ti4  9 the  vectors  - A "'b  alternate 

(12)  \ between  two  directions  in  tip1  «.  and<,  for  each  k9  the 
line  joining  x,  „ and  x,  passes  through  A “Sd» 

It  is  therefore  the  proposal  of  [5]  that  the  optimum  gradient  method 
occasionally  be  interrupted  by  determining  the  x^.  + (1  “°^)xk„2 

which  minimizes  f[x(®^  )]*  IIs  the  acceleration  procedure  occurs 

after  m steps9  the  computing  procedure  is  to  set  x - x^  and 

d=x  « - x in  (10  and  (£)«  and  the  acceleration  formulas  ares 
m-2  m 9 


V_ 


d ® x - x 
m m*=2  ®. 


m 


Bx  - c 

m 


A,  " dJ  Vd-T  » d~* 


m 


m 


m ' 


Vi 


S x 


m 


- t d 


m an 


(13) 
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It  is  Professor  Hestenes * observation  that 
r the  of  any  acceleration  step  is  the  vector  in 

(1) | ) the  two-dimensional  subspace  x(<=<  ^ °(^)  ® x^^ 

” ' i ....  ? ” B ^'k~2  minimizes  f [x(°^  ^ ^ 2)] 

^ ^ith  respect  to  °<  1 and  <<  « 

Since,  in  using  the  optimum  gradient  method  numerically.,  one  is 
already  set  up  to  carry  out  the  types  of  operations  involved^, 
the  procedure  (13)  may  be  preferable  to  a more  direct  method  for 
carrying  out  the  above  two-dimensional  minimization „ (The  exten- 
sion of  this  idea  to  the  use  of  n successive  one-dimensional  steps 
to  minimize  f(x)  in  n dimensions  is  at  the  basis  of  the  algorithm 
in  [8].)  In  section  III  will  be  found  reports  of  numerical  experi- 
ments with  the  acceleration  step®  Various  numbers  of  optimum 
gradient  steps  have  been  tried  between  accelerations © 

We  also  report  insertion  of  the  acceleration  step  (13)  into 
the  modified  gradient  method  (10)  for  $ - Id®  In  this  case 
Professor  Hestenes ! interpretation  does  not  hold* 

Of  the  various  statements  made  above5  the  only  ones  which 
require  proof  are  (12)  and  (lli)* 

To  prove  (12)  we  may  assume  without  loss  of  generality  that 
b - ©o  It  then  suffices  to  show  that  is  parallel  to  x^®  The 

loci  f(x)  ~ constant  are  similar  ellipses  in  if  9 Let  t be  the 
tangent  at  x^  to  the  ellipse  through  :x^  e Then  the  gradient  at 
Xq  is  orthogonal  to  t^*  Since  f ^ is  also  the  tangent  at  x^ 
to  the  ellipse  through  Xp  is  also  orthogonal  to  the  gradient 

at  x^.  Being  both  orthogonal  to  the  lines  f ^ and  are  parallel, 
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Since  f ^ is  a tangent  at  x,?  to  the  ellipse  through  x^  is  parallel 
to  Xq5  as  was  to  be  proved,. 

To  prove  1 hs  we  note  that  (12)  implies  that  the  acceleration  step 
will  locate  the  common  center  of  the  ellipses  formed  by  the  intersection 
of  the  surfaces  f(x)  = constant  with  the  plane  through  x^ll)  and 

x^„  It  therefore  suffices  to  prove  that  this  plane  is  actually  the 

A 

plan^  k“2  k™2  (°<  . arbitrary)  , But  f ® ” c 


=2  “ ^k”2:i 


^ k“2  ^k“2  B ^k“2e  Then 


y * x f 

\ Vl  k=l  k - 


xj^2  “ ^ ^ k==2  + ^k-l)  t k~2  ” -2  B ^k™2°  Hence  the  normcollineai'3 

points  x^>  ^ and  x^  are  all.  in  the  plane  of  ^k  ° ~ ^ B ^k“2,5> 

and  (ll|)  is  proved  a 

When  the  conjectured  asymptotic  behavior  (11)  occurs  for  a given  x^ 

“1- 

we  saw  above  that  x^  “ A b asymptotically  approaches  0 while  alternating 
between  two  directions  Ls  L°  in  the  plane  Here  L°  are  related  by 
the  fact  that  their  conjugate"  directions  with  respect  to  the  ellipses 
f(x)  s constant,  in'W'  are  orthogonal,,  When  n m 3 the  proof  in  [I|]  of  (11) 
shows  that,f  when  Xq  has  a projection  on  each  eigenvector  of  B5.  not  all 
pairs  liS  L8  are  eligible  to  be  asymptotic  directions  of  - A \>«>  Roughly 
speaking,,  the  eligible  directions  Ls  L1  are  those  for  which  f(x^)/  f 
(which  has  one.  value  for  both  L and  L8)  is  sufficiently  near  its  maximum 


value 


see 


* Two  directions  are  conjugate  with  respect  to  an  ellipse  if  they  are 
the  directions  of  a radius  (from  the  center)  and  a tangent  at  the  same 
point. 
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For  each  direction  L of  x - A there  is  a ratio  f ej°(L) 
of  f(x)  to  \x  - A here  jP(L)  is  commonly  unequal  to  f (L5)® 
We  see  that 


(15)  f (L)  - - (x-^^x-A-1!) 

|x  - Av  |x  - A no  I" 

Thus  f (L)  is  the  Rayleigh  quotient  of  the  error  vector  x - A ho* 
We  therefore  have 


(16) 


'h  4 ^(L)  4 Xn  s 


see  p0  26  of  [3ls  for  example e From,  the  foregoing  it  follows  that5 
whenever  the  conjectured  asymptotic  behavior  (11)  of  x^.  holds*, 
jx^.  - A b j goes  to  zero  in  such  a manner  that  the  ratio 
|x^  ~ A \|/Sx^  ” A \>  \ will  alternately  approach  the  two  limits 
p f (L)ff  (L°)5  ja  f (LO/f  (L)o  Only  for  the  special  case  when 
^P(L)  ® f (L»)  (meaning  that  Ls  L8  have  symmetric  positions  with 
respect  to  the  major  axis  of  the  ellipses)  will  these  limits  be 
equal e 


It  is  instructive  to  study  the  modified  gradient  process 
(10)  analytically  in  two  dimensions  „ Some  of  the  characteristic 
behavior  of  the  runs  reported  in  section  111  occurs  ~ the  instability 
of  f(x^)/f(x^,a^)  for  f?  slightly  less  than  ls  and  the  approach  of 
f(\)/f(: \-i> t0  f2  for  most  :x.q  when  1 < ^ ^ Q < 2e  This  study 
has  been  stalled  by  one  of  us  with  Motzkin  [ unpublished] s and 
will  not  be  reproduced  here® 
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IIIo  EXPERIMENTS  AND  TABLES 

The  several  methods  described  in  section  II  were  tried  out 
with  two  essentially  different  systems  of  order  six  on  the  IBM 
Card-Programmed  Calculator  of  the  Institute  for  Numerical  Analysis, 
The  order  six  is  the  largest  for  which  an  ordinary  gradient  step 
could  be  handled  with  the  internal  storage  of  the  machine  usedf, 
for  these  exploratory  experiments  it  was  thought  better  to  spend 
as  little  time  as  possible  using  external  storage c (Even  so, 
our  acceleration  steps  required  external  storage ,)  The  coefficients 
of  the  first  system, Ax  ” b,  were  obtained  from  a table  of  random 
digits  simulating  a population  of  equally  distributed  integers 


99,  -98, 

0, 

®e©,  + 99 

• Me 

obtained 

' -lU 

.95 

61 

Uo 

3 

hi  ‘ 

2? 

-3U 

17 

”78 

39 

A ® 

13 

92 

”63 

26 

15 

”86 

2> 

“23 

86 

30 

95 

»80 

”76 

12 

52 

17 

61 

-3U 

h2 

_ -70 

-6U 

kZ 

U7 

23 

28 

b = 

[-93, 

-96, 

”71, 

26s 

-69, 

71  fo 

In  the  methods  reported  here  A and  b do  not  explicitly  enter  the 

calculations,  but  only  B ® A ‘A  and  c ® A b.  For  the  above  matrix; 

the  latter  are  given  a subscript  0$  for  scaling  purposes  B^  was 
then  multiplied  by  10-  , and  by  10  0 


B 


0 
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.06667 

<,02631+ 

™ .01+61+0 

“.07368 

-.02131 

“.00)431° 

.02631+ 

.26810. 

“ .0221+3 

ol5952 

-.05923 

-.12797 

- .01+61+0 

”.022l+3 

.10932 

.05150 

“ .01+100 

.08558 

“.07368 

.15952 

.05150 

.25152 

-.0XH+X 

".07169 

-•  .02131 

”.05923 

- .01+100 

-.0111+1 

.11+1+03 

.01105 

- .001+31 

“.12797 

.08558 

“.07169 

.01105 

.191+50 

- .0086095 

=.011+2795 

-.00021+35 

+.001+5765 

4 .00801+35 

-.001+895] 

Our  first  experiments  were  carried  out  with  B 9 c^9  in  order  to 

study  the  machine  procedure  on  a matrix  without  zeros®  These  B^ 

and  cQ  were  also  used  by  Stein  [XI]  5 who  calls  them  A and  bc 

The  gradient  methods  are  invariant  under  translations  and 

rotations  of  the  space 5 as  long  as  the  operations  are  carried  out 

exactly,,  The  machine  operations  are  much  faster  with  a diagonal 

matrix,  and  the  behavior  of  x^  is  more  easily  studied  in  the 

coordinate  system  of  the  eigenvectors  of  BQ.  Accordingly  the  positive 

T 

definite  matrix  B^  was  reduced  to  its  diagonal,  form  B^  s S S 9 
where  S is  an  orthogonal  matrix®  At  the  same  time  was  replaced 
by  - 0<>  the  zero^veetor*  so  that  the  solution  became  0O 
(Probably  in  practice  the  round-off  errors  with  Bp  are  less 
than  with  but  these  errors  are  not  studied  here.) 

We  h,ave 

*Mr«,  H.  M.  Hayes  determined  and  S on  the  C ard-P ro  gramme d Calculator. 


o0026870U 
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e 015 81310 
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o 17590130 
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e259U6632  (-. 
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eU9823U36 


ci'  t0* 


o. 


0, 


,T 

Q]  o 


The  ratio  P - ® 200  of  the  eigenvalues  of  EL  and  B_ 

of  0 1 

was  noted  in  section  II  to  be  intimately  related  to  the  speed  of 
convergence  of  the  optimum  gradient  method j indeed^  from  (8). 

p « (P  - 1)  (P  + If1  4 1-  2P“2 . 

To  get  data  from  a system  for  which  convergence  was  likely  to  be 
faster^  we  selected  a diagonal  matrix  with  P ® 36*  The  other 
eigenvalues  1 ^ were  selected  at  random  from  tables  simulating 
a,  rectangular  distribution  of  -log  A ^ on  the  interval  log  A 
£ log  ^ log  A^0  We  obtained 


Under  the  hypothesis  of  our  foot  not  es  page  k above5  the  ,sprobabl 
value  of  P is  ,%bout,9!  n~ a 
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For  each  matrix  we  used  various  modifications  of  the  gradient 
method^,  and  began  with  various  initial  vectors  Xq<>  A summary  of 
the  experiments  run  is  contained  in  Table  2o  On  the  Card" Programmed 
Calculator  we  used  a ten-digit  board  with,  fixed  decimal  pointy  designed 
by  Dr-„  Everett  C„  Yowell  of  the  Institute  for  'Numerical  Analysis <, 

Under  the  supervision  of  one.  of  us  the  experiments  were  run  by 
Messrs o Thomas  Do  Lakln^  William  0o  Paine j,  Jr 0?  and  Albert  HG  Rosenthal® 
The  data  were  checked  in  two  ways  g (i)  the  jf(x^)|  were 
scanned  for  reasonableness!  (ii)  the  experiments  with  matrices  B^ 
and  were  run  twice®  Since  check  (ii)  seldom  indicated  an  error 
which  had  not  been  suspected  from  check  (i)5  it  was  decided  to  get 
extra  data  for  matrix  Bn  by  omitting  check  (ii)„  These  data  therefore 
have  a higher  probability  of  error  than  those  for  matrices  B^  and 
B.p,  but  we  hope  that  they  are  essentially  correct «, 

In  Table  1 we  show  the  detailed  progress  of  one  run®  The 
matrix  and  initial  vector  are  the  same  (in  a different  coordinate 


system)  as  those  reported  by  Stein  [11] p and  the  table  may  be 
compared  with  his  Table  I0  The  run  chosen  consists  of  8 optimum 
gradient  steps  ( ^ ® 160)5  alternated  with  one  acceleration®  Since 
ft  s ieo,  it  results  from  (li|)  that  this  is  equivalent  to  6 minimizar 
iions  of  f[x(  °k)]  0n  the  line  x s ^ ^ K s>  followed  by  one 

minimization  of  f[x(^  ^ ^ ,,,)]  in  the  2-plane  x “ x^.  - ^ ® 

and  repeat  o If  we  used  a machine  with  sufficient  internal  storage^, 
the  amount  of  effort  in  each  cycle  would  be  reasonably  equivalent  to 
9 of  the  optimum  gradient  steps® 


"The  run  is  identified  by  an  arrow  in  Table  2® 


TABLE  1 


One  of  Our  Accelerated  Runs 


k 

lO^f^x  j J i (k~l<,k)  “ 

EAV/lvxk-r 

0 

.33360  3 

1 

11925  6 

o35?5 

2 

8538  ' 1 

.7159 

3 

6999  2 

.8198 

h 

6230  k 

.8902 

5 

9779  8 

.9 277 

6 

5U90  5 

«9U99 

7 

5263  8 

.95 87 

8 

5075  5 

0 9 61x2 

9 

iil5o  5 

.8178 

10 

3U3i  k 

08267 

11 

33iil  o 

.9737 

12 

3258  1 

a9752i 

13 

3179  6 

,9759 

Hi 

3103  7 

.9761 

3030  1 

o97  63 

16 

2958  3 

,9163 

17 

2888  3 

,9163 

18 

178  10 

0O617 

19 

62  250 

.31x95 

20 

36  $W± 

•5871 

21 

2?  ia.8 

•7503 

22 

23  81x5 

.8697 

23 

21  U57 

.8999 

2li 

19  557 

o911ix 

25 

17  950 

.9178 

26 

16  551 

.9221 

27 

7 lUoU 

.lx3lU 

28 

6 0682 

.81x98 

29 

5 7512 

.9878 

30 

5 5369 

.9627 

31 

5 3678 

.9695 

32 

5 2 U65 

.977U 

33 

5 1293 

.9777 

3k 

5 011x8 

.9777 

35 

lx  9030 

.9777 

36 

5055  7 

.0103 

37 

3085  3 

.6103 

38 

2230  8 

.7230 

39 

1736  2 

•7783 

Uo 

11x1x5  8 

.8327 

lii 

1267  0 

.8763 

1x2 

mix  1 

.8793 

h3 

980  01 

.8796 

ij* 

.00000  00  862  25 

.8798 

k 

- 102f(Xk)  =W 



:b5 

,00000  00011  072 

0O128 

46 

35833 

00321+ 

1+7 

30050 

<,8386 

1+8 

2831)9 

o9li3i+ 

1+9 

26827 

o9U63 

5 0 

25UU9 

0 9 U86 

51 

21+199 

o9509 

52 

23060 

o9529 

53 

22021 

0 95U9 

5k 

11206 

„5089 

55 

5767  7 

o5U+7 

56 

538?  k 

,9332 

57 

510U  0 

*9b83 

58 

1+880  1 

0 95 61 

59 

1+690  0 

,9 610 

6° 

1+520  2 

,9638 

6l 

1+365  5 

a 965  8 

62 

1+222  2 

,9672 

63 

21+20  9 

,5731+ 

61+ 

1956  0 

08080 

65 

1779  9 

.9100 

66 

1681+  h 

0 91+63 

6? 

1601  h 

,9507 

68 

152?  8 

,951+0 

69 

11+59  0 

,9550 

70 

139h  6 

,9559 

71 

| 133k  0 

,9565 

72 

31+3  23 

,2573 

73 

88  906 

,2590 

?i+ 

82  1+80 

,9277 

75 

80  370 

,971+1+ 

76 

78  1+02 

,9755 

77 

76  509 

,9758 

78 

7h  670 

o9?60 

79 

72  880 

,9760 

80 

L 00000  00000  00071  138 

1 

.9761 

— 

81 

o 00000  00000  00007  1757 

0I009 

82 

3 1i982 

,5875 

83 

2 3839 

,6815 

8it 

1 9269 

,8083 

85' 

1 700li 

,8825 

86 

1 5681 

,9222 

87 

1 U5  26 

,9263 

88 

1 35o6 

,9298 

8? 

l 2597 

,9327 

90 

52U99 

o5l6S 

91 

l|ii791 

,8532 

92 

51313 

,9225 

93 

39336 

0 95 21 

9b 

37650 

,9571 

95 

36157 

,9603 

96 

35726 

,9605 

97 

33355 

,9605 

98 

32038 

,9605 

99 

1068  l 

,0333 

100 

36  6?5 

,0353 

101 

32  875 

,8965 

102 

32  156 

o9?82 

103 

31  558 

,9783 

lOli 

30  775 

,9783 

io5 

30  108 

,9783 

106 

29  555 

,9783 

107 

28  81? 

o9783 

108 

35263 

.on? 

109 

12630 

.3686 

no 

9811  6 

,7768 

in 

7881  h 

,8036 

112 

6355  9 

,8057 

113 

5in  7 

.8056 

nli 

5118  6 

.8057 

ii5 

3318  7 

.8058 

n6 

2675  3 

.8058 

117 

5 366o 

.0016 

1 118 

0 00000  00000  00000  00000  00002  8m 

.6539 
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This  type  of  run  showed  the  fastest  convergence  of  f(x^)  to  0|  de- 
tailed comparisons  with  other  runs  will  be  found  in  Table  2 and  section 
IV  below. 

In  Table  1 each  heavy  horizontal  rule  indicates  an  acceleration 
step.  The  small  discrepancies  from  the  data  in  [ 11]  result  from 
round- off  errors  in  rotating.,  the  coordinate  system. 

To  save  space  and  bring  out  the  important  features  of  the  data, 
the  other  runs  are  presented  here  only  in  summary  form  (Table  2). 

The  columns  of  that  table  are  now  described. 

Column  Is  Here  we  give  the  matrix:  B s J^'L9  and  vector  c = JlTdo, 


& 

A 


Column  2.s  Here  we  give  the  initial  column  vector 
following  abbreviations  s 
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Me  use  the 


e 

« 

( o. 

0, 

o. 

0, 

0, 

0 

)x 

(i) 

x0 

s 

( .1, 

ol. 

ol,£ 

ol. 

«i. 

)T 

x (2) 

s 

(-.1, 

» 1 
oJ~S> 

el^ 

°1'5) 

ol. 

o 1 

)T 

X (3) 

xo 

s 

( .U595, 

. 0790 , 

.1200, 

.0880c, 

.0150, 

.0113 )T 

X (U) 

xo 

s 

( ,oo5. 

o005. 

•oo5. 

*00^, 

.01 

)T 

x & 
xn 

s 

( .01, 

•01, 

.01, 

© 01^ 

.01, 

.01 

)T 

vO 
» — » 

> o 
X 

St 

( .1, 

o.l\S> 

elf, 

© 

el. 

ol 

)T 

X t7) 
x0 

* 

( .05, 

»o5. 

<>05, 

.05, 

.05, 

? 

O* 

CO 

s 

C ° 

®o5. 

.05, 

°o  5, 

.05 

)T 
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Most  of  these  vectors  were  chosen  without  ar y special  significance, 
to  see  how  the  methods  varied  with  different  startso  The  vector  © is 
a reasonable  start  with  the  non-homogeneous  problem  with  cq®  The 


vector  x, 


0 


jn  the  coordinate  system  of  Bl5,  © is  identical  with 


run  is  an  iteration  of  the 


© in  the  coordinate  system  of  Bq5  cq< 

Column  3 g ft  is  defined  in 
Column  Us  For  all  ft s as 
step  in  formula  (10)„ 

For  2,  the  term  88m  steps  and  accelerate ,8'  means  that  m 
steps  of  type  (10),  resulting  in  xQ,  -o,  x^,  are 

alternated  with  a minimization  of  £(x)  for  x along  the  line  joining 


x _ and  x « according  to 
ra-2  nr 

For  ft-  2 a special  acceleration  was  performed  % After  getting 
the  points  Xq,  x~s  Xg,  one  optimum  gradient  (ft  ^ 1)  step 

was  taken  from  the  point  x s \ (xu,  + Xg)s  Then  8 more  steps  follow- 
ed  with  ft  " 2 j etc® 

Column  5>  g In  getting  the  total  numbe^  called  ss  of  steps  of 
a run,  each  acceleration  is  counted  as  one  step® 

Column  6g  The  number  r(lu  5 k0)  measures  the  mean  proportion- 
ate reduction  in  £(x,.)  per  step,  for  k between  k^  and  k^9  where 


It  Is  defined  by  the  relation 


r(k_,  k0)  * 


¥e 


may  interpret  r(lc  , k^)  as  the  geometric  mean  of  the  k , - k^ 


ratios  | f (x^)/f  (x^ 


.i 


+ 1,  K + 2,  ««<><,  k2, 


It  seems  to  us  that  for  appropriate  choice  of  k_i5  r(kp  k,)  is 
a useful  index  of  the  average  speed  of  the  iterative  process  for 
solving  the  system  Ax  ~ b.  For  the  modified  gradient  processes  with 
0 < & < 2S  one  always  has  0 = r(k^$  k?)  < 10  One  must  remember^, 
however^,  that  the  time  of  convergence  of  an  iteration  varies  linearly 
with  log(l/r)5  not  with  rj  see  (lU)  and  Table  U® 

The  quantity  r(0$s)  would  measure  the  average  reduction  in 
f(x)  from,  the  beginning  x~  to  the  end  x,  of  a run*,  Since  in  some 

V S 

runs  a substantial  portion  of  the  reduction  from  f(xn)  to  f(x  ) occurs 

vJ  3 

in  the  first  few  steps*,  where  x;,  has  not  yet  settled  down  (see  Table  1) 
the  quantity  r(0p  s)  gives  a deceptively  high  impression  of  the  speed 
of  an  iteration*,  To  avoid  this  initial  effect^  we  selected  k.  « 


and  accordingly  have  tabulated  r(55  s)  in  column  6 as  a measure  of 
the  mean  speed  of  the  process  over  the  major  portion  of  the  runG 
Column  7 s In  certain  of  the  processes  the  quantity 
r(k“l5  k)  ® f(%^  1)/f(xk)  settles  down  and  appears  to  approach  a 
limit o (The  conjecture  (11)  would  imply  that,  for  all  x^ 


f(x^  1)/f(xk)  approaches  a limit  in  the  optimum  gradient  process. 
Since  this  limit  would  be  a measure  of  the  ultimate  speed  of  the. 


process^  ws  give  the  last  value s r ( s-1,  s)  ^ f(xe 


x )«  in 


column  7«  In  other  processes,,  where  the  ratio  f (x^_^)/f (x^.)  does 
not  settle  down,  we  make  no  entry c 


For  comparison  with  Table  we  present  in  Table  3 a summary 


of  Stein  “s  runs  [11]  in.  the  same  form.. 
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TABLE  2 

Summary  of  Our  Runs 


(1)  (2)  (3)  ft)  (5)  (6)  (?) 


MATRIX 

START 

<3 

METHOD 

HD.  OF  STEPS 

r(s“l£s) 

(s) 

w 

o 

a 

o 

© 

lc0 

1+  Steps  and  accelerate 

76 

,8226 

B0S  c0 

© 

1,0 

•J  88  19  88 

65 

.8002 

98 

98 

88 

g 86  90  6! 

UU 

,7552 

l( 

88 

It 

CJ  88  98  88 

k9 

,833U 

88 

w 

83 

12  " 18  88 

80 

,8295 

88 

•y.  (l) 
-n 

10 

g 88  88  88 

83 

,8379 

88 

Z(2) 

30 

g 88  88  98 

$k 

,7717 

1(3) 

u 

2*0 

Straight  Run 

77 

1,0000 

98 

59 

lyl 

80  58 

119 

,9775 

,9786 

88 

88 

1,0 

00  88 

70 

,9733 

o97U8 

88 

08 

,9 

08  88 

8? 

,82014, 

88 

99 

2.0 

8 Steps  and  accelerate 

78 

,87ii9 

88 

00 

1,1 

88  88  88 

79 

,7873 

89 

80 

1.0 

00  88  08 

119 

o62U5 

88 

.y  (U) 

<?9 

Straight  Run 

75 

,8,150 

08 

08 

1.0 

88  80 

53 

o9758 

88 

, 6) 
*0 

*9 

88  99 

85 

,8310 

89 

98 

1,0 

88  88 

ho 

,9293 

,9693 

B25  ® 

(6) 

*0 

1.0 

8 Steps  and  accelerate 

55 

,U566 

88 

89 

80 

Straight  Run 

lh 

98836 

,8939 

88 

88 

,9 

18  8! 

73 

,653© 

Bgg  © 

x (7) 
x0 

1,0 

8 Steps  and  accelerate 

11? 

,14738 

89 

18 

1,0 

Straight  Run 

69 

,8809 

08917 

88 

89 

*9 

IJ  88 

71 

o?H7 

„ (8) 

1.0 

8 Steps  and  accelerate 

123 

0U373 

88 

18 

10 

Straight  Run 

73 

,8903 

08938 

88 

— 

88 

o/ 

88  18 

71 

,6333 

Notes  is  similar  to  B^e  B^5  © with  ^ is  same  problem  as  B^  cf)  with  ©„ 


TABLE  3 


Summary  of  Stein's  Runs 


(1) 

(2) 

(3) 

(U)  . 

(5)  . 

(6) 

(7) 

MATRIX 

START 

$ 

METHOD 

HO.  STEPS 

4 

vn 

Vs 

m 

r(s-l.s) 

B . c 
cr  o 

e 

cl 

Straight  Runs 

30 

. 933k 

88r 

88' 

.3 

381 

30 

o9$09 

«S( 

18! 

.6 

88: 

30 

.9339 

89! 

88 

08 

88: 

30 

.8685 

18 

88: 

« 

CO 

vn 

88; 

30 

.9156 

891 

98 

.9 

83 

30 

.80 65 

88 

81 

.95 

n 

30 

c955U 

88! 

98 

1.0 

88i 

30 

.9702 

.9752: 

88 

381 

1.1 

88: 

30 

. 9737 

.9779 

89 

88: 

1.3 

88: 

30 

.9739 

.9779 

88: 

88: 

1.6 

88: 

30 

e 97  2.8 

.9779 

88! 

18; 

1.9 

! ..  .3s' ; 

30 

.9385 

.9755 
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It  may  be  useful  occasionally  to  transform  r ® r(k^ s k?) 
to  a unit  which  is  proportional  to  the  time  spent  in  an  iteration* 
Let  K ” K(r)  be  the  number  of  steps  needed  to  reduce  jAx^.  - bj 
by  one  decimal  place j i.e.5  to  one-tenth  of  its  value ^ when 


2. 

f(x^i)/f  (3^)  has  the  mean  value  r.  Since  f(x^)  ® |Ax^  - b | s 
find  K(r)  from  the  relation 


we 


r 


,K(r: 


whence 


(1U) 


10 


2 


-2 


K'r^  * log^0(l/r) 


In  Table  k we  give  K(r)  for  some  values  of  r.  Note  that  a small 
variation  in  r near  1 makes  a large  difference  in  K(r)j  the  quantity 
K is  an  approximate  measure  of  the  time  required  to  solve  a system* 

I?.  STUDY  OF  THE  DATA 


The  first  question  we  faced  was;  at  what  intervals  should  the 
acceleration  step  (13)  best  be  applied  to  the  optimum  gradient  method 
(6)?  There  seem  to  be  two  possibilities o (i)  Some  property  of 
the  sequence  ^ might  indicate  when  it  was  ready  to  be  accelerated 
for  example 5 the  property  of  having  ^-l5  a"'jnos^  ^n 

a 2-plane . To  simplify  the  I0B*M*  procedure  this  was  not  tried, 

(ii)  The  acceleration  step  could  be  inserted  after  each  m steps* 

The  latter  procedure  was  adopted^,  and  we  needed  to  select  a preferred 


value  of  i» 
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TABI E ii 


Iterative  Steps  Per 

1-Decimal  Reduction  of  Jibe  - b | 

r~ 

"“Wj 

.01 

X.O 

■ eX. 

2*0 

.2 

2.9 

.3 

3.8 

oil 

5.x 

.5 

6 © 6 

» 6 

9.0 

13  oO 

.8 

20  e 6 

33.0 

*90 

13.7 

„9X 

ii8„8 

*92 

55  o3 

o93 

63.5 

*9U 

7 ii.ii 

o 95 

89 .8 

*96 

3X3 

. 97 

X6Il 

*98 

227 

©98.5 

305 

*990 

U58 

o99X 

509 

.992 

57ii 

o993 

656 

.994 

765 

e995 

9X9 

.996 

XXii9 

.997 

1533 

*998 

2300 

.999 

U603 

2k 


For  the  matrix  B we  made  several  tests  designed  to  choose  m„ 
u 

In  one*  starting  always  with  a vector  near  the  of  Table  we 

repeatedly  ran  12  optimum  gradient  steps  and  an  acceleration  step. 

These  thirteen  steps  were  performed  in  10  different  ways5  viz  with 

the  acceleration  step  following  the  2nd^  3rda  ° * °9  or  11th  optimum 

gradient  step.  The  results  of  the  test  (not  shown  in  this  paper) 

were  that  one  got  to  the  least  f(x)  in  these  thirteen  steps  when  the 

acceleration  was  taken  as  late  as  possible . This  fact  by  itself 

suggests  taking  a large  value  of  m©  On  the  other  hand<,  use  of  a 

large  m means  that  relatively  fewer  accelerations  can  be  taken  in 

a run  of  a given  number  of  steps©  Since  the  main  reduction  in  f(x) 

comes  in  the  accelerations^  one  wants  as  many  such  steps  as  possible. 

The  balance  between  these  opposing  factors  determines  the  best 

value  for  m„  For  the  matrix  we  tried  out  m ® 7 9 8#  9p  and 

12.  Study  of  the  values  of  r(55s)  for  the  runs  with  B^  at  the 

start,  of  Table  2 indicates  that  (for  x ® ■9)  m “ 8 is  best,,  while 

u 

m « 7 is  next  besto 

It  is  probable  that  the  optimal  v alue  of  m depends  on  the  value 
of  the  initial  vector  Xq.  Me  suspect^,  however ^ that  the  variation 
of  m with  Xq  is  commonly  slight^  because  each  acceleration  is  a comp- 
licated transformation  which  effectively  produces  a new  initial  vector. 
Thus  each  single  run  is  the  average  of  a.  number  of  different  starts. 
With  matrices  B^  and  the  procedure  with  m ® 8 was  run  for  more 
than  one  start  x^®  The  variation  in  r(5$s)  is  not  great j see  Table  2* 
Thai  m depends  materially  on  the  matrix  B and  its  order  m is  not 
questionedf;  it  would  be  an  interesting  experiment  to  study  the  depend- 


ence on  these  factors 
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Hairing  accepted  m = 8 as  the  best  value  for  the  matrix 
(and  for  the  similar  matrix  B ^)s  we  retained  m ® 8 for  the  matrix 
Bp  but  we  do  not  claim  it  to  be  optimalo 

Table  1 shows  a run  of  11,8  steps  with  matrix  Bp  using  m - 8 

(3) 

and,  starting  with  the  vector  x^  s which  corresponds  to  the  start 
0 for  Bq  used  above  and  in  [11] „ The  table  of  values  of  f(x^)  and 
their  ratios  shows  what  a complicated  process  we  are  dealing  with® 

Ey  a cycle  we  refer  to  a block  of  eight  optimum  gradient  steps  follow- 
ed by  one  acceleration®  The  cycles  vary  greatly  in  their  success 
in  reducing  f(x^)o  When  the  optimum  gradient  steps  within  a cycle 
bring  x^.  - A b nearly  into  the  plane  Tf  defined  in  section  2$  then 
the  following  acceleration  brings  f(%^+^)  to  a value  much  less  than 
f(XjP©  The  efficiency  of  the  next  following  cycles  varies  within 
wide  bounds^,  apparently  according  to  rather  subtle  properties  of 
the  vector  x^+^  "^)«  Thus  the  progress  of  the  error  vector 

to  zero  is  irregular^  and  the  values  of  r in  Table  1 are  certainly 
not  predictable®  Part  of  the  reason  for  this  is  thats  as  can  be 
shown,,  the  asymptotic  behavior  of  the  direction  vectors 
(\  - if  ^b)/  - Amlb  | is  exceedingly  sensitive  to  a change  in 

the  initial  vector  xQO 

Nevertheless  - and  this  is  the  important  practical  consideration  - 
in  the  accelerated  procedure  x^  was  able  to  progress  rapidly  toward 
the  solution  A For  the  matrix  Bp  the  average  reduction  r( 5^119) 

in  f(x)  was  *6215  (Table  2)s  representing  a speed  of  convergence  18 
times  as  fast  as  that  of  the  optimum  gradient  method,,  for  which 
r(86„87)  s .9?180  (ieec  (©9718)1®  ■ » 


6215 «)  For  the  matrix  B^s 


the 
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corresponding  figures  for  starts  :xrj ' \ xQ^  were  *>5566 

and  „ 893%,  #14-738  and  #89X7,,  .5373  and  #8938°’  here  the  speed  was 
7 or  8 times  that  of  the  optimum  gradient  method# 

It  seems  impossible  to  say  how  the  improvement  would  behave 
with  larger  n# 

The  irregular  decrease  of  f(.^)  was  observed  also  by  Stein 
[11]  for  the  almost  optimum  methods  ( § & 0#9)e  The  gradient  methods 
for  @ i 1 are  slow  to  converge  because  the  sequence  {\ " 4-1  b] 
approaches  periodicity#  On  the  other  hand*,  the  strongly  non-periodic 
character  of  our  accelerated  process  and  of  the  almost  optimum  process 
is5  we  believe,,  at  the  root  of  their  success j the  vectors  - k 8 
are  not  permitted  to  settle  down  into  a periodicity. 

We  have  in  Table  2 a few  data  which  afford  a comparison  of  the 

accelerated  gradient  process  m « 8 with  the  almost  optimum,  method 

$ » 0®9o  The  comparison  should  be  regarded  as  only  tentative^,  since 

we  have  no  idea  how  the  speed  of  either  process  varies  with  the  order 

(3) 

of  the  matrices  or  other  factors#  For  the  matrix:  B.p  and  start  s 

we  find  r(5587)  ~ #8205  for  87  steps  of  the  almost  optimum  process 
( ft  » 009)'%  while  for  the  accelerated  process  (m  ® 8)  we  have 

r(5g  119)  * *6255°  For  B^s  Gs  and  starts  xq  ^s>  the 

comparable  figures  ares  #6.530  and  « U5 66 3 #7117  and  #5738j  #6333 
and  * 5373.  If  these  figures  are  representative^,  the  accelerated 
scheme  ( m s 8)  is  about  twice  as  fast  as  the  almost-optiraum  method 
( $ s 0.9)*  Cm  the  other  hand,,  the  latter  process  surely  requires 


'■’'For  the  similar  matrix  BQ^  Stein’s  runs  yield  r(5*30)  * #8065  (see 
Table  3)|  the  irregularity  of  the  process  accounts  for  the  discrepancy* 
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a simpler  and  shorter  code  - an  exceedingly  important  advantage  with 
machines o 

In  the  straight  nans  with  and  of  the  optimum  gradient 
method  ( ft  ® i)  .in  Table  2 we  have  several  samples  of  the  ratio 
r(s”l5  s)5  which  appears  to  be  near  its  limit e According  to  (8)p 
this  ratio  cannot  exceed  p 0 Now 

for  Bq  and  B^  p'~  s „ 97 865 83 2 $ 

for  Bg,  p - e89li8l373  » 

We  find  that,,  of  the  six  values  of  r{s“l5s)s  all  are  greater  than 

9 

99  per  cent  of  their'  maximum  value  jx~$  while  five  exceed  99  06  per 
cento  This  confirms  the  statement  made  in  section  II  that  Kantovorichls 
inequality  (9)  is  close  to  an  equality  in  practice c 

In  our  one  long  run  made  with  § > X5  we  observe  that- 
r(s=-l5s)  * 0978665,  which  agrees  with  to  five  decimals 0 We  also 
observe  thatp  in  Stein 5 s four  runs  with  (3  > 15  three  of  them  have 
r(s“lsS')  = <,9779  © We  thus  observe  for  n ® 6 an  apparent  behavior 
of  the  modified  gradient  method  (X<^<  2)  which  is  like  that  remark” 
ed  in  section  II  to  be  universally  true  for  ri  « 2%  for  most  xQ  and 
for  {$  in  a range  1 < ft  & ^ < 2S  one  has  f (x^)/f  (x^_^)  ^ as 


k «n 
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